Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Shotgun Metagenomic Data Analysis ◾ 315

taxonomic ID, taxonomic rank (kingdom, genus, family, etc.), genome size in bp, number

of reads classified to this genomic sequence including multi-classified reads, number of

reads uniquely classified to this genomic sequence, and abundance proportion as shown in

Figure 8.2. The Centrifuge report shows that the metagenomic reads have been assigned to

taxonomic group in different ranks. This report can be further analyzed to filter the most

significant taxa based on their summary statistics.

8.2.4 Assembly of Metagenomes

In the free-assembly microbial profiling, we could assign a taxonomic group to the metage-

nomic sequences and we could also obtain some useful statistics. In contrast, the assembly-

based profiling requires metagenome assembly, which is faced by challenges not present in

the assembly of a single genome discussed in Chapter 2. The metagenomic data includes

reads for several microbes with different sequence coverage rather than a uniform cover-

age, which is assumed by typical assemblers to assemble a single genome. Assuming a

uniform sequence coverage will allow to distinguish true sequences from errors, identify

repeat sequences, and identify allelic variation. That assumption is invalid in metagenome

assembly because the coverage of the genome of each species in the sample depends on the

abundance of that species in the sample. Since metagenome assembly is performed with

de novo approach that uses de Bruijn graph, low sequence coverage may lead to incontigu-

ous path. Assemblers can overcome that challenge by using a short k-mer size but that will

also increase the frequencies of identical k-mers in the graph, compromising the assembly

quality.

One more challenging problem faced by metagenome assembly is the presence of sub-

strains of the same bacterial species and that makes the graph more complex. The metage-

nome assemblers attempt to overcome these challenges using different strategies. As an

example, we will use metaSPAdes [7], which uses de Bruijn graph to form an assembly

graph and it also works across a wide range of coverage depths and attempts to main-

tain the trade-off between the accuracy and continuity of the metagenome assemblies.

metaSPAdes is one of the SPAdes programs that we discussed in Chapter 3. Refer to that

chapter for SPAdes installation. If you followed the installation instructions of SPAdes in

Chapter 3, you would have added its path to the “.bashrc” file. To check if the program is

installed and it is on the path, run the following:

metaspades.py

This will display the usage and options of metaSPAdes program. Otherwise, you may need

to install the program following the installation instructions.

The following metaSPAdes command will perform de novo metagenome assembly using

the metagenomic FASTQ files as input:

mkdir metag_healthy

metaspades.py \

-o metag_healthy \

-1 fastq_pure/ERR1823587_pure_R1-50.fastq.gz \